A home is often the largest and most expensive purchase a person makes in his or her lifetime. Ensuring homeowners have a trusted way to monitor this asset is incredibly important.
In this competition, students are required to develop a full-fledged approach to make predictions about the future sale prices of homes. A full-fledged approach constist, at least, in the following steps:
Now, should you ask a home buyer to describe their dream house, they probably wouldn't begin with describing features such as the height of the basement ceiling or the proximity to a railroad. As you will see, the dataset we use in this competition proves that many more features influence price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in a small city in the US, this competition challenges you to predict the final price of each home.
Here the summary of what we are going to do through the notebook
The first thing to do is to retrieve the raw data from disk: we have a dataset (that we will use to define and train our model) and a test set (on which we will make the model run, as required for the challenge).
# missing python libraries
!pip install seaborn
!pip install statsmodels
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import re
import pprint
import json
from typing import *
# load the dataset
dataset = pd.read_csv('challenge_data/train.csv', sep=',', keep_default_na=True, header='infer' )
# load the test set
testset = pd.read_csv('challenge_data/test.csv', sep=',', keep_default_na=True, header='infer' )
# save the Id column for the final result
testset_Ids = testset['Id']
dataset.shape
testset.shape
After a first look at the table, we can notice that there are some problems with some features where 'NA' is really a 'Not Available' entry. Pandas treats them as NaN values: we need to reconvert these values where required.
dataset.iloc[:10]
The provided text file called "Data Description.rtf" is a sort of reference documentation for the features we have inside the dataset. As we can read, available features are both qualitative (Categorical data) and quantitative (Continuous data): they have a short description, we can use to understand what the feature actually is and what are its possible values (if qualitative).
The next step is to convert this documentation in a JSON file: it could be used to define integrity rules over the data and it could be more easily accessed by us if needed (feature lookup).
The first 20 lines of the rtf file look like this:
! head -n 20 ./challenge_data/Data\ description.rtf
The idea is to remove the first 10 lines which are not useful and focus on the feature description only. We need also to remove the last line which looks likes this:
! tail -n 1 ./challenge_data/Data\ description.rtf
Using bash commands and regular expression we can easily remove the lines and the '\' characters.
In addition, we want to add a character '#' to the beginning of each feature name which will be used to split the text.
! cat ./challenge_data/Data\ description.rtf | tail -n +12 | \
head -n -1 | sed -e 's,\\,,g' | sed -e 's,^\(\w*\:\),#\1,g' > ./challenge_data/feature_descriptions.txt
! head -n 10 ./challenge_data/feature_descriptions.txt
We can now create a Python dictionary from this file and then convert it into the JSON format (exploiting the json library). The dictionary will contain a key for each feature, while each value will be another dictionary with the following keys:
# read the RTF file
f = open('./challenge_data/feature_descriptions.txt', 'r')
text = f.read()
f.close()
# split the file in order to have a list of strings, each one containing a feature except the first element which is empty
features_text = text.split("#")[1:]
# sample
print(features_text[5])
features_metadata = {}
for feature_raw in features_text:
# clean the text from some special token combination
feature = (feature_raw
.replace('\t\t', '')
.replace('\n\t\n', '\n\n')
.replace('\n\t\t\n', '\n\n')
.replace('\n \n', '\n\n')
)
# split between the feature name and description and its values (if any)
try:
tmp = feature.split('\n\n')
feature_name = tmp[0]
feature_values = tmp[1]
except ValueError:
print(indx)
except IndexError:
print(indx)
#Â separate the name from the descritpion
key, descr_raw = feature_name.split(':')
# clean the descritpion leading white space
description = re.sub('^ ', '', descr_raw)
# populate the dictionary including the new feature and its description
features_metadata[key] = {
'description' : description,
}
# control if the feature is discrete or not and populate the entry
if feature_values != '':
# include the info in the dictionary
features_metadata[key]['continuous'] = False
features_metadata[key]['values'] = {}
entries_cleaned = feature_values.replace('\t\n', '\n')
# split the values (one per line)
entries_split = [value for value in entries_cleaned.split('\n')]
# filter the entries composed of a string of white spaces only
entries_raw = [x for x in filter(lambda x: re.sub(' +', '', x), entries_split)]
# split the value name from the descritpion
entries_raw = [value_descr.split('\t') for value_descr in entries_raw]
# filter the entries composed of a string of white spaces only
entries = [ [x for x in filter(lambda x: re.sub(' +', '', x), entry_raw)] for entry_raw in entries_raw ]
for indx, entry in enumerate(entries):
# get the value and clean from whitespaces
value = re.sub(' +', '', entry[0])
# prepare the descritpion
try:
description = entry[1]
except IndexError:
print(indx)
# create the entry with the feature value as key and include the descritpion
features_metadata[key]['values'][value] = {
'description': description,
# we include also a numerical id
'numerical_id': indx
}
else:
features_metadata[key]['continuous'] = True
To better visualize the data structure, let's see the result considering one feature:
pprint.pprint(features_metadata['Condition1'], depth=3)
...Are we sure that the dataset's features have the same names specified in our documentation? Let us check it
# some features do not appear inside the documentation
no_docs_features = ['Id', 'SalePrice']
# check the number of available features
dataset_features = set(dataset.drop(no_docs_features, axis=1).columns.values)
docs_features = set(features_metadata.keys())
assert len(dataset_features) - len(docs_features) == 0, "Not equal length between the two sets"
# see which features are in the Dataset but not in the Metadata file
dataset_features - docs_features
# see which features are in the Metadata file but not in the Dataset
docs_features - dataset_features
The answer is no: in fact, we can see two discrepances in the names of this two features.
It looks like it is just an incongruence for the names (something like a typo): to be sure, let's see if the dataset dataset_features match the ones described inside the documentation (or, if quantitative, could be fine with the feature description and its nature)
# BedroomAbvGr and Bedroom
print("Unique values for 'BedroomAbvGr': {!s}".format(dataset['BedroomAbvGr'].unique()))
print("Description for 'Bedroom': \n {}".format(features_metadata['Bedroom']['description']))
# KitchenAbvGr and Kitchen
print("Unique values for 'KitchenAbvGr': {!s}".format(dataset['KitchenAbvGr'].unique()))
print("Description for 'Kitchen': \n {}".format(features_metadata['Kitchen']['description']))
As we can see, the metadata description is very similar to the dataset's feature names. We can then correct the JSON Metadata file (our documentation) in order to make it compliant with the Dataset
# correct the documentation
features_metadata['BedroomAbvGr'] = features_metadata.pop('Bedroom')
features_metadata['KitchenAbvGr'] = features_metadata.pop('Kitchen')
Now, we can save on disk the documentatin in JSON format.
# save the JSON docs file
f = open('./challenge_data/feature_descritpion.json', 'w')
f.write(json.dumps(features_metadata))
f.close()
Given a feature name, we want a function able to to present the docs information about that feature in a more user friendly way
def PrintFeatureMetadata(feature: str, drop_id_col=True):
"""
Function to properly print all the metadata about the passed feature name.
It prints a dataframe with the values associated with that feature if discrate,
otherwise it doesn't print the values.
Params:
- feature: the name of the feature to be printed
- drop_id_col: keep or not the 'numerical_id' column when printing
"""
print("Feature name: {0!s}".format(feature))
print("--- Continuous: {1!s}".format("", features_metadata[feature]['continuous']))
print("--- Description: {1!s}".format("", features_metadata[feature]['description']))
print('\n')
if not features_metadata[feature]['continuous']:
df = pd.DataFrame(data=features_metadata[feature]['values']).T
df['values'] = df.index
#Â select if visualize the numerical id or not
if drop_id_col:
col = ['values', 'description']
else:
col = ['values', 'numerical_id', 'description']
print(
df[col]
.reset_index(drop=True)
.sort_values(['values'], ascending=[1])
)
print('\n')
# an example
PrintFeatureMetadata('Condition1')
As we have seen before, Pandas automatically converts "NA" values as "NaN" ones (Not a Number). The problem is that there are actually some fields (categorical string features) that require to maintain "NA" as value.
In order to spot these fields and fix the problem, we exploit the created JSON metadata file
# create the dictionary whose keys are the fields that need to maintain the "NA" value
NaN_dict = {}
for k in features_metadata.keys():
if not features_metadata[k]['continuous'] and 'NA' in features_metadata[k]['values']:
NaN_dict[k] = 'NA'
# convert NaN as 'NA'
dataset = dataset.fillna(NaN_dict)
testset = testset.fillna(NaN_dict)
Let us control the remaining dataset features who present NaN values:
NaN_dataset = dataset.isnull().sum().sort_values(ascending=False)[lambda x: x > 0]
for x in NaN_dataset.index:
PrintFeatureMetadata(x)
sns.barplot(x=NaN_dataset.index, y=NaN_dataset)
_ = plt.xticks(rotation='45')
Our goal is to recover these "NaN" values.
According to the dataset documentation, we can decide to set them as:
We will also create the boolean "missing_garage" field, in order to make the model understand when the apartment has no garage
# create the "missing_garage" field (which is not documented)
no_docs_features.append('missing_garage')
dataset['missing_garage'] = dataset['GarageYrBlt'].isna()
# recover NaN values
dataset = dataset.fillna(
{
'MasVnrType': 'None',
'GarageYrBlt': 0,
'LotFrontage': 0,
'MasVnrArea': 0
}
)
We now replicate what we have done before for the test set
NaN_testset = testset.isnull().sum().sort_values(ascending=False)[lambda x: x > 0]
for x in NaN_testset.index:
PrintFeatureMetadata(x)
These fields are the same spotted for the dataset: but we have also "Electrical" now
sns.barplot(x=NaN_testset.index, y=NaN_testset)
_ = plt.xticks(rotation='45')
# create the "missing_garage" field
testset['missing_garage'] = testset['GarageYrBlt'].isna()
# recover NaN values
testset = testset.fillna(
{
'MasVnrType': 'None',
'GarageYrBlt': 0,
'LotFrontage': 0,
'MasVnrArea': 0
}
)
In order to create better models, categorical features need to be remapped as dummy ones (or ratings, when possible). In other words, each possible value for a given categorical feature will generate a new feature (and the original one will be dropped)
The class we are going to define acts as a dataset manager: given a dataset (Pandas Dataframe) and eventually its documentation (Python dictionary retrieved from a JSON file), we can define extra integrity rules (Python dictionary) for the data of each feature and spot records who present errors.
In this way, incongruencies can be spotted and recovered: unrecoverable errors can be deleted instead. This class wants to be a reusable and scalable solution, that can be adopted for datasets which present the same nature (in this way, integrity rules are defined only once), automatizing some of the data structure pre-processing operations.
class DataRules(object):
'''
This class defines a set of rules that can be used to check the integrity of the columns (features) of a dataset
'''
def __init__(self, dataset: pd.core.frame.DataFrame, key_field: str, json_doc: dict):
self._rules = {}
self._dataset = dataset
self._key_field = key_field
self._json_doc = json_doc
def set_dataset(self, dataset: pd.core.frame.DataFrame, json_doc=None):
'''
Assign a specific dataset (and eventually a different documentation) to the dictionary of rules
'''
self._dataset = dataset
if json_doc is not None:
self._json_doc = json_doc
def rules(self):
''' Retrieve the dictionary of rules '''
return self._rules
def data(self):
''' Retrieve the dataset attached to the rules '''
return self._dataset
def docs(self):
''' Retrieve the dataset documentation '''
return self._json_doc
def add_rule(self, for_columns: Iterable, rule: Callable):
'''
Associate a rule to one or more columns, whose names are passed as a list
'''
for col in for_columns:
if col in self._rules:
# some rules are already existing for this column: add this one too
self._rules[col].append(rule)
else:
# no rules have been already defined for this column
self._rules[col] = [rule]
def clear_rules(self):
'''
Delete all the rules
'''
self._rules = {}
def clear_rules_for(self, column_names: Iterable):
'''
Delete all the rules associated to the columns passed as parameter
'''
for col in column_names:
if col in self._rules:
self._rules.pop(col)
def destroy(self):
'''
Destroy this object and free the memory
'''
# Python does not allow you to explicitly free the memory...
# But what is not referenced anymore will be "deleted" automatically by the garbage collector
self._rules = {}
self._dataset = None
self._key_field = None
self._json_doc = None
def spot_wrong_records(self, verbose=False):
'''
Print records of the dataset that do not satisfy the rules.
For simplicity, only the key and the field with the wrong value are printed.
'''
ok = True
# consider every field of the dataframe
for col in self._dataset: #(set(self._rules.keys()) - set(['SalesPrice'])):
# some columns have no documentation: we need to skip them
if col in ['SalePrice', 'Id', 'missing_garage']:
continue
log = ''
anomalies = list()
# apply user-defined integrity rules, if present
if col in self._rules:
for rule in self._rules[col]:
# retrieve non-compliant records
res = self._dataset.filter(items=[self._key_field, col])[rule(self._dataset[col]) == False]
if not res.empty:
# update result information
log += str(res) + '\n'
anomalies.append(res[col])
# extra check for the categorical fields
if 'values' in self._json_doc[col]:
# some fields are categorical: we need to check if their values are compliant with the documentation
possible_values = set(self._json_doc[col]['values'])
current_values = set(map(str, self._dataset[col].unique())) # JSON keys are always strings
# if there are values which are not described inside the documentation, they must be wrong
irregular_values = current_values - possible_values
if len(irregular_values) > 0:
# the NaN value has to be properly managed
if 'nan' in irregular_values:
irregular_values.add(np.NaN)
irregular_values.remove('nan')
# retrieve irregular records
res = self._dataset.filter(items=[self._key_field, col])[self._dataset[col].isin(irregular_values)]
log += str(res) + '\n'
anomalies.append(res[col])
# final results
anomaly_found = log != ''
if anomaly_found:
ok = False
anomalies = sorted(pd.concat(anomalies).unique())
if verbose:
print('\n*** Analysing: "{}" ***'.format(col))
if anomaly_found:
print('Problems found!')
print('Critical values: {}'.format(anomalies))
print(log)
else:
print('No problems found.')
else:
if anomaly_found:
print('Field: {}'.format(col))
print('Wrong values: {}\n'.format(anomalies))
# if everything was ok, print a message
if ok:
print('Everything all right!')
def clean_dataset(self):
'''
Filter records of the dataset that do not satisfy the rules
'''
original_size = len(self._dataset)
# consider every field of the dataframe
for col in self._dataset: #(set(self._rules.keys()) - set(['SalesPrice'])):
if col in ['SalePrice', 'Id', 'missing_garage']:
continue
# check user-defined integrity rules, if present
if col in self._rules:
for rule in self._rules[col]:
# keep only regular records
self._dataset = self._dataset[rule(self._dataset[col])]
# extra check for the categorical fields
if 'values' in self._json_doc[col]:
# some fields are categorical: we need to check if their values are compliant with the documentation
possible_values = set(self._json_doc[col]['values'])
current_values = set(map(str, self._dataset[col].unique())) # JSON keys are always strings
# if there are values which are not described inside the documentation, they must be wrong
irregular_values = current_values - possible_values
if len(irregular_values) > 0:
# the NaN value has to be properly managed
if 'nan' in irregular_values:
irregular_values.add(np.NaN)
irregular_values.remove('nan')
# delete records with wrong values for the given field
self._dataset = self._dataset[~self._dataset[col].isin(irregular_values)]
# all right
resulting_size = len(self._dataset)
print('Cleaning completed: {} records have been deleted'.format(original_size - resulting_size))
return self._dataset
First of all, we instantiate the DataRules class: for the moment, we are going to consider the dataset and we are going to define integrity rules for the features
# create the dictionary of rules
apartmentRules = DataRules(dataset, 'Id', features_metadata)
Because several features share the same integrity rules, they are divided into logical families and processed together
# define features integrity rules
# divide columns of the dataset into families, according to the nature of their data
columns = {}
columns['all'] = set(dataset.columns)
# numbers: positive integers
columns['int_non_zero'] = set(['Id', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice'])
apartmentRules.add_rule(columns['int_non_zero'], lambda x: x > 0)
columns['int'] = set(['EnclosedPorch', 'MasVnrArea', 'LotFrontage', '2ndFlrSF', 'PoolArea', 'OpenPorchSF', 'TotalBsmtSF', 'WoodDeckSF', 'ScreenPorch', 'BsmtFinSF1', 'MiscVal', 'GarageArea', 'LowQualFinSF', '3SsnPorch', 'BsmtFinSF2', 'BsmtUnfSF'])
apartmentRules.add_rule(columns['int'], lambda x: x >= 0)
# small numbers: again, positive integers
columns['small_int'] = set(['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'KitchenAbvGr', 'BedroomAbvGr'])
apartmentRules.add_rule(columns['small_int'], lambda x: x >= 0)
# years
columns['year'] = set(['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold'])
# no useful rules can be defined for this class of fields
# months of the year (numerical)
columns['month'] = set(['MoSold'])
apartmentRules.add_rule(columns['month'], lambda x: (x >= 1) & (x <= 12))
Once the integrity rules have been defined, we can check if there are some incongruencies inside the dataset
apartmentRules.spot_wrong_records()
Now, we would like to understand if some of those errors are recoverable (eg: incongruencies with the syntax declared inside the documentation) or if they need to be deleted
# let us consider the 'MSZoning' field
print('Current values: {}'.format(dataset['MSZoning'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['MSZoning']['values'].keys())))
It looks like that "C" and "C (all)" could be actually the same value: let us check it
features_metadata['MSZoning']['values']['C']['description']
We think they have the same meaning: for this reason, every "C (all)" value inside the dataset will be converted as "C"
dataset['MSZoning'] = dataset['MSZoning'].map(lambda x: 'C' if x == 'C (all)' else x)
# let us consider the 'Neighborhood' field
print('Current values: {}'.format(dataset['Neighborhood'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['Neighborhood']['values'].keys())))
This is clearly a syntax error: the correct value should be "Names" (instead of "NAmes"). We are going to fix it
dataset['Neighborhood'] = dataset['Neighborhood'].map(lambda x: 'Names' if x == 'NAmes' else x)
# let us consider the 'BldgType' field
print('Current values: {}'.format(dataset['BldgType'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['BldgType']['values'].keys())))
While '2fmCon' and 'Duplex' can be easily seen as syntax error, one may argue that 'Twnhs' could refer both to 'TwnhsE' or 'TwnhsI'. Anyway, we think that the desired one is 'TwnhsI', because the other one is already present inside the dataset. We proceed with the data cleaning
d = {
'2fmCon': '2FmCon',
'Duplex': 'Duplx',
'Twnhs': 'TwnhsI'
}
dataset['BldgType'] = dataset['BldgType'].map(lambda x: d[x] if x in d else x)
# let us consider the 'Exterior1st' field
print('Current values: {}'.format(dataset['Exterior1st'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['Exterior1st']['values'].keys())))
Again, the value 'Wd Sdng' was another incongruency (should be 'WdSdng'): correct it
dataset['Exterior1st'] = dataset['Exterior1st'].map(lambda x: 'WdSdng' if x == 'Wd Sdng' else x)
# let us consider the 'Exterior2nd' field
print('Current values: {}'.format(dataset['Exterior2nd'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['Exterior2nd']['values'].keys())))
All the four spotted wrong values are actually incongruencies:
d = {
'Brk Cmn': 'BrkComm',
'CmentBd': 'CemntBd',
'Wd Sdng': 'WdSdng',
'Wd Shng': 'WdShing'
}
dataset['Exterior2nd'] = dataset['Exterior2nd'].map(lambda x: d[x] if x in d else x)
We can now see that all the dataset incongruency errors have been recovered
apartmentRules.spot_wrong_records()
We repeat now the procedure for the test set, deleting eventual unexpected (because clearly wrong) contents. Integrity rules to apply are the same as before, so we do not need to reinstantiate a new rules dictionary:
# assign the rules to the test set
apartmentRules.set_dataset(testset)
# and see if there are wrong records
apartmentRules.spot_wrong_records()
Apart the "Electrical" feature, these incongruencies are the same spotted before for the dataset. We can then proceed to the cleaning in the same way as before
# MSZoning
testset['MSZoning'] = testset['MSZoning'].map(lambda x: 'C' if x == 'C (all)' else x)
# Neighborhood
testset['Neighborhood'] = testset['Neighborhood'].map(lambda x: 'Names' if x == 'NAmes' else x)
# BldgType
d = {
'2fmCon': '2FmCon',
'Duplex': 'Duplx',
'Twnhs': 'TwnhsI'
}
testset['BldgType'] = testset['BldgType'].map(lambda x: d[x] if x in d else x)
# Exterior1st
testset['Exterior1st'] = testset['Exterior1st'].map(lambda x: 'WdSdng' if x == 'Wd Sdng' else x)
# Exterior2nd
d = {
'CmentBd': 'CemntBd',
'Wd Sdng': 'WdSdng',
'Wd Shng': 'WdShing'
}
testset['Exterior2nd'] = testset['Exterior2nd'].map(lambda x: d[x] if x in d else x)
The test set records cannot be deleted: we need to keep it because we are required to predict the SalePrice for all of them.
So, we have to find a way to recover the incongruency for the only field who present "NaN" as a value for this feature.
Note that, if the problem would have appeared in the dataset, this record could have simply been deleted
# the incongruent record
testset[testset['Electrical'].isna()]
# existing values for this feature
sns.barplot(x=testset['Electrical'].index, y=testset['Electrical'])
# documentation about this feature
PrintFeatureMetadata('Electrical')
It looks like none of the possible values are able to explicit that no information is available about the electrical system (or that the system is missing).
We decide to fix the flaw replacing "NaN" with "FuseF", which is the most common value for this feature inside the test set
testset['Electrical'] = testset['Electrical'].map(lambda x: 'FuseF' if x is np.NaN else x)
# check that all the incongruency errors have been recovered
apartmentRules.spot_wrong_records()
Both the dataset and the test set are now compliant with the standard.
# we do not need the rules dictionary anymore
apartmentRules.destroy()
The goal now is to convert qualitative features (string type fields, or ratings) into boolean dummy ones.
In other words, these features will be converted in as many boolean features as their possible values described inside the documentation. The original predictors (categorical) will be deleted. In order to speed up this process, we define a proper method.
def remap_categorical_fields(data):
'''
This function convert several categorical fields as numerical ratings or boolean values.
In order to do that, it is required that field's categories can be compared each other (they values can be sorted by weight/importance)
'''
def process_cols(cols_list, dictionary, compulsory=True):
'''
Transform a list of columns (fields) according to the values mapped by the dictionary
'''
for c in cols_list:
if not compulsory:
# create a boolean field to indicate if the value is missing or not
data['{}_missing'.format(c)] = data[c].map(lambda x: x == 'NA')
# remap the values
data[c] = data[c].map(lambda x: dictionary[x])
# "CentralAir" is a boolean field
d = {'Y': 1, 'N': 0}
l = ['CentralAir']
process_cols(l, d)
# "Id" is not of interest
data = data.drop(['Id'], axis=1)
# final result: dummy features
return pd.get_dummies(data)
# transform discrete-value columns as booleans
dataset = remap_categorical_fields(dataset)
# final result
dataset[:6]
# transform discrete-value columns as booleans
testset = remap_categorical_fields(testset)
# final result
testset[:6]
# we are going to modify the data: keep track of the current DataFrames
dataset_complete = dataset.copy()
testset_complete = testset.copy()
Now, the dataset structure is ready and we can focus on analysing our data. The idea is to describe them statistically in order to discover feature properties and/or mutual relations, so that we can understand which model we can implement and how to do it in the best way.
To resume, we want to understand:
First of all, we want to understand if at least one of the features can help us to predict the SalePrice (that is, there is a relationship between these two variables).
More specifically, what we are going to check is a linear additive relationship. The way to proceed is to perform an hypotesis test, in which:
This test is performed by computing the F-statistic:
$$ F = \frac{(TSS-RSS)\ /\ p}{RSS\ /\ (n-p-1)} $$
fitting a multiple linear regression model on the data
$$ \hat{y_i} = \hat{\beta_0} + \sum_{i=1}^p \hat{\beta_i} x_i $$
where: $$ TSS = \sum_{i=1}^n (y_i - \bar{y})^2 $$
$$ RSS = \sum_{i=1}^n (y_i - \hat{y_i})^2 $$
$$ \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i $$ $$ p = \text{number of dataset's features} $$ $$ n = \text{number of dataset's records} $$
from sklearn import linear_model
def features_and_target(apartments_df, target='SalePrice'):
'''
Split the apartments DataFrame in features (X) and targets (y)
'''
X = apartments_df.loc[:, apartments_df.columns != target]
y = apartments_df[target]
return X, y
def lmr_model_stats(X, Y, verbose=True):
'''
This method can be used to evaluate e linear multiple regression model on the fly and retrieve some performance
statistics, about how it is able to fit the data and explain the variance of the response variables using the features
Printed information are:
- MSE
- RSE
- R2 score
- F-statistics
Input parameters are:
- X: the data on which fit the model (features)
- Y: the targets
'''
# number of records and number of features
n, p = X.shape
# evaluate the model (Multiple Linear Regression)
lr = linear_model.LinearRegression()
model = lr.fit(X, Y)
# retrieve predictions: y^
Y_pred = lr.predict(X)
# compute MSE, RSE and R2 score
mse = sklearn.metrics.mean_squared_error(Y, Y_pred)
rse = np.sqrt(mse*n / (n-p-1))
R2 = sklearn.metrics.r2_score(Y, Y_pred)
# compute F-statistic
y_mean = np.mean(Y)
tmp = Y-y_mean
TSS = np.dot(tmp, tmp)
RSS = mse*n
F = ((TSS-RSS)/p) / (RSS/(n-p-1))
# results
if verbose:
print('MSE = {}'.format(mse))
print('RSE = {}'.format(rse))
print('R2 = {}'.format(R2))
print('F-statistic: {}'.format(F))
else:
return mse, rse, R2, F
# isolate features and target
X, y = features_and_target(dataset)
# evaluate model performance statistics
lmr_model_stats(X, y)
$R^2$ and RSE (Residual Standard Error) are numerical values we can use to score and interpret the results our model can produce. As a general rule, $R^2$ should be as high as possible (its possible values are between 0 and 1, both included) while RSE should be low (its minimum possible value is 0).
They can be used to compare several models, built upon several opearting conditions, and chose the best one.
The F-statistic is a number equal or greater than 1: the higher it is, the more confident we are in rejecting the null hypotesis. The fact we obtained more than 47 allows us to think that at least one of the dataset's features could be used to predict the target.
We feel that the multiple linear regression could be an interesting model choice, but we need to verify that its underlying assumptions are met.
We know that at least one of the feature has a relation with the response variable. So, the question now is: are all the features really useful?
We expect that only some of the features are really of interest when predicting the SalePrice: also, we are interested in reducing the dimensionality of the problem. The reason is that, focusing only on the most important predictors, we are able to generate a model:
The reason we have chosen this method (the fourth, inside the article) to perform the feature selection is that we don't have to specify a priori the number of desired features. Also, it allows us to see the importance score for each predictor and have a global view of the situation
from sklearn.ensemble import ExtraTreesClassifier
def feature_importance(X, y, threshold=None, importances=None, method='extraTrees', figsize=(70, 10)):
'''
Feature extraction: repeat the analysis several times, and average the results.
If a threshold is specified, it highlights only the feature whose importance is greater than the specified value
If the importances are already given, the computation is skipped.
"method" is chosen accordingly to the technique we want to use to perform the feature selection
- "extraTrees": use the ExtraTreesClassifier
- "gradBoosting": use GradientBoostingRegressor
Returns an array, whose values are the importance value associated to the "i-th" feature
'''
if method == 'extraTrees':
N_trees = 100
model = ExtraTreesClassifier(n_estimators=N_trees, bootstrap=True)
elif method == 'gradBoosting':
model = sklearn.ensemble.GradientBoostingRegressor()
else:
raise ValueError('The specified method is invalid. Try "extraTrees" or "gradBoosting".')
# compute importance values
if importances is None:
res = None
n = 25
for i in range(n):
model.fit(X, y)
if res is None:
res = model.feature_importances_
else:
res += model.feature_importances_
res /= n
else:
res = importances
# visualize results
plt.figure(figsize=figsize)
plt.title("Feature importance", fontsize=25, y=1.02)
if threshold is None:
# all the feature
sns.barplot(x=X.columns, y=res)
else:
# only the most important ones
colorLight = '#81ecec'
colorDark = '#00b894'
colors = [colorDark if _y >= threshold else colorLight for _y in res]
# plot the bars and the threshold
sns.barplot(x=X.columns, y=res, palette=colors)
plt.plot(X.columns, [threshold]*len(X.columns), 'c-')
plt.xticks(rotation='90')
plt.grid(True)
# return importance values
return res
res = feature_importance(X, y)
We need now to decide which features we want to keep: the idea is to drop the features whose importance is lower than twice the importance mean value
m = np.mean(res)
print(m)
colorThreshold = 2*m
_ = feature_importance(X, y, colorThreshold, res)
n_total = len(res)
n_selected = (res > colorThreshold).sum()
w_selected = res[res > colorThreshold].sum()
print('Original number of features: {}'.format(n_total))
print('Selected features: {} ({} %)'.format(n_selected, round(n_selected/n_total*100, 2)))
print('Selected features total weight: {}'.format(round(w_selected, 2)))
# drop the non-interesting features
dataset = dataset.filter(items=np.append(X.columns[res > colorThreshold].values, 'SalePrice'))
X = X.filter(items=X.columns[res > colorThreshold])
Let us recompute the model performance metrics, about how it is able to fit the data and explain the variance of the target using the predictors
# evaluate model performance statistics
lmr_model_stats(X, y)
As a consequence of reducing the number of considered features, we can see that the model overfits less the data. Because of that, MSE and $R^2$ score are worse: anyway, we can see that filtering a subset of features guarantees us more meaningful results (the F-statistic is much more higher in this case).
The reason of this analysis is that the presence of collinearity (in other words: high correlation between features) can pose problems in the regression context. In fact, it makes difficult to separate out the individual effects of collinear variables on the response (target)
Now, we are going to compute and analyze the correlation matrix between the features.
# compute the correlation matrix
correlation_df = dataset.corr()
# define a mask for the upper triangular value of the matrix (it's simmetric, with the diagonal being 1)
mask = np.zeros_like(correlation_df)
mask[np.triu_indices_from(mask)] = True
# plot the correlation matrix
plt.figure(figsize=[40,40])
_ = sns.heatmap(correlation_df, mask=mask, vmin=-1, vmax=1, square=True, linewidths=.6, \
annot=True, fmt=".3f", cmap="coolwarm", cbar_kws={"shrink": 0.5})
The obtained correlation matrix is huge, but a first look is enough to understand that several features are actually correlated each other (blue and red cells).
Now, consider only the correlation between each feature and the target variable
#Â prepare the dataframe from the series returned by correlation_df['SalePrice']
sale_price_corr_df = pd.DataFrame(correlation_df['SalePrice']).rename(columns={'SalePrice': 'Correlation'})
sale_price_corr_df['Features'] = sale_price_corr_df.index
#Â remove SalePrice among the features
sale_price_corr_df = sale_price_corr_df[sale_price_corr_df.Features != 'SalePrice']
# change the color for the highest bars
colorLight = '#7FB3D5'
colorDark = '#0984e3'
colorThreshold = 0.60
colors = [colorDark if abs(_y) >= colorThreshold else colorLight for _y in sale_price_corr_df['Correlation']]
# plot the graph
plt.figure(figsize=[20,8])
plt.title("SalePrice correlation between the other numerical features", fontsize=25, y=1.02)
ax = sns.barplot(x='Features', y='Correlation', data=sale_price_corr_df, palette=colors)
_, _ = plt.xticks(rotation='90')
plt.grid(True)
As we can observe, we have derived some insights from the correlation. There are features like OverallQual, GrLivArea and GarageCars which have a high linear correlation between their values and the final price. That is, as this values increases also the price increases.
Let us focus on features with a correlation higher (in absolute value) than .60
correlated_features_df = sale_price_corr_df[sale_price_corr_df.Correlation > colorThreshold].sort_values('Correlation', ascending=0)
correlated_features_df
# correlated features
features = list(correlated_features_df.index)
correlated_df = dataset[features].corr()
# define a mask for the upper triangular value of the matrix (it's simmetric with the diagonal being 1)
mask = np.zeros_like(correlated_df)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=[12,12])
_ = sns.heatmap(correlated_df, mask=mask, vmin=0.0, vmax=1, square=True, linewidths=.6, \
annot=True, fmt=".3f", cmap="YlGnBu", cbar_kws={"shrink": 0.7})
_, _ = plt.yticks(rotation=0)
_, _ = plt.xticks(rotation=45)
The predictors the most correlated to the target are also correlated each other: potentially, it could mean that some of those features are not really interesting. It may be that some of these variables seems to be useful to predict the target not because they really are, but only because they are correlated to other useful variables.
We would like to get rid of these unnecessary predictors: anyway, we know that looking at the correlation matrix could not be sufficient to spot them. There may exist situations in which the collinearity involves three or more features, even if no pair of variables has a particularly high correlation. This situation is known as multicollinearity.
Because of the results we have observed, it seems that we should perform the collinearity analysis in order to check that all the requirements to build a linear regression model are really satisfied.
A way to spot the problem is to evaluate the VIF (Variance Inflation Factor) for each feature
$$ \text{VIF}(\hat{\beta_j}) = \frac{1}{1 - R^2_{X_j | X_{-j}}} $$
where $ R^2_{X_j | X_{-j}} $ is the $ R^2 $ from a regression of $ X_j $ onto all of the other predictors. If this value is close to 1, collinearity is present and the VIF will be large.
The collinearity problem itself can be solved in two ways:
Anyway, we are not able to merge together our features: they are quite heterogeneous and they are the result of a previous selection procedure.
So, the idea is to drop those variables for which the VIF value results to be high.
In general, a predictor is considered "problematic" if its VIF value is higher than 5
# we implement now methods to manage VIF values, inspired by the code found on the Internet
# https://stats.stackexchange.com/questions/155028/how-to-systematically-remove-collinear-variables-in-python
# docs: http://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html
from statsmodels.stats.outliers_influence import variance_inflation_factor
def collinearity_analysis(X, thresh=5, drop=True, verbose=False):
'''
Given a Pandas dataframe ("X"), we evaluate the VIF for every feature.
If "drop" = True, problematic features are directly deleted: otherwise, they are just listed
Note that a feature is considered problematic if its VIF is greater than "thresh" (which is 5 by default,
as recommended in the documentation for the "variance_inflation_factor" method)
'''
cols = X.columns
variables = np.arange(X.shape[1])
# list of problematic predictors
problems = []
# drop features from "variables" until every VIF value is ok
dropped=True
while dropped:
dropped=False
c = X[cols[variables]].values
vif = [variance_inflation_factor(c, ix) for ix in np.arange(c.shape[1])]
maxloc = vif.index(max(vif))
if max(vif) > thresh:
if drop and verbose:
print('dropping \'' + X[cols[variables]].columns[maxloc] + '\' at index: ' + str(maxloc))
problems.append(X[cols[variables]].columns[maxloc])
variables = np.delete(variables, maxloc)
dropped=True
if drop:
if verbose:
print('Remaining variables:')
print(X.columns[variables])
return X[cols[variables]]
else:
return problems
Here, you can see the F-statistic values obtained for the different thresholds
Let's continue
It could be useful to understand if the dependency between features and target is linear or not. In order to check it, we can print the graph of the residuals (between targets and predictions) vs the predicted values: if the dependency is not linear, we could be able to see a graph who has not a linear behaviour
def residual_plot(y_pred, residuals, color="g", order=1):
plt.figure(figsize=(14, 7))
sns.regplot(x=y_pred, y=residuals, color=color, scatter_kws={"s": 80}, order=order, ci=None, truncate=True)
plt.title('Residuals graph', size=18)
plt.xlabel('Predicted SalePrice', size=14)
plt.ylabel('Residual', size=14)
# evaluate the model (Multiple Linear Regression)
lr = linear_model.LinearRegression()
model = lr.fit(X, y)
# retrieve predictions: y^
Y_pred = lr.predict(X)
# evaluate the residuals
residuals = y - Y_pred
# plot the residual graph
residual_plot(Y_pred, residuals)
residual_plot(Y_pred, residuals, color='b', order=2)
It is convenient to model error terms as uncorrelated random variables: to check if this assumption holds, we can see how the plot residuals vs dataset samples (tracking of the residuals) appear. If we are able to discern a pattern from plotted samples, it means we have correlation.
The reason we are going to perform this analysis is that, in case of correlated error terms, estimated standard errors would be underestimates of the true ones. Actually, this is a typical problem of time series (which is not our case) and we do not expect to have this kind of problem. Anyway, it could arise because of poor experimental design condition while collecting the data.
# plot the residuals tracking
plt.figure(figsize=(14,7))
plt.plot(residuals)
plt.title('Residuals tracking', size=18)
plt.xlabel('Dataset sample', size=14)
_ = plt.ylabel('Residual', size=14)
Fine. There is no evidence of error terms correlation
We want to see if the samples are not affected by heteroscedasticity: which means, that the variance associated to the various error components should be always the same (constant).
The way we can proceed to understand it, is to plot again the residuals graph: if the plot has a funnel shape, it means that the problem is present
# plot the residual graph
residual_plot(Y_pred, residuals, order=2)
# transform the response variable
Y2 = np.sqrt(y)
Y2_pred = np.sqrt(Y_pred)
residuals2 = Y2 - Y2_pred
# plot the residual graph
residual_plot(Y2_pred, residuals2, order=2)
The problem is solved: now, plotted residuals have a more uniform disposal.
This is how the plot would have looked like using the natural logarithm:
# transform the response variable
Y2_log = np.log(y)
Y2_pred_log = np.log(Y_pred)
residuals2_log = Y2_log - Y2_pred_log
# plot the residual graph
residual_plot(Y2_pred_log, residuals2_log, color='r', order=2)
Before training the model, it is important to spot and remove outliers and high leverage points: in other words, points for which $y_i$ is very far from the value predicted by the model (outliers) or who have unusual values for the features (high leverage points).
Altough they may not influence how the linear regression model is built (but they could...), their presence makes the RSE (Residual Standard Error) to grow. Which means, we have problematics when we try to evaluate confidence intervals and p-values. Also the $R^2$ becomes worse.
They could be the result of a wrong observation, could arise because of anomalies or could simply represent phenomena that our model is not able to take into account. In a certain sense, they could be the evidence of the presence of flaws: so, we need to pay attention when we decide to drop the data.
They tend to have a sizable impact on how the regression line is evaluated: there is the possibility they invalidate the entire fit.
The Python library StatsModels provide a very powerful tool called Influence plot:
# to perform the analysis, we need to recompute the model specifically with StatsModel
lm = sm.OLS(Y2, X).fit()
# now, we can print the Influence plot
fig, ax = plt.subplots(figsize=(80,60))
fig = sm.graphics.influence_plot(lm, criterion="cooks", ax=ax)
plt.grid(True)
# high leverage points: from right to left
hl_points = [
313, 335, 249, 185, 635, 1173, 88, 934, 706, 170, 297, 197, 1031, 1009, 883, 523
]
# outliers
outliers = [
970, 688, 1181, 142, 608, # top-left
583, 496, 1182, # top
714, 431, 560, 1062, 666, 874, 462, 410, 812, 968, 916, 588, 30, 632, # bottom-left
523, #bottom
495 #right
]
As an example, we try to understand why the apartment id=523 is an outlier
# a little analysis: the greatest outlier (523)
tmp = pd.options.display.max_columns
pd.options.display.max_columns = 999
print(dataset.iloc[523])
pd.options.display.max_columns = tmp
# 184750
print('Avg SalePrice:')
print('{} (~184750)'.format(round(dataset_complete['SalePrice'].mean(), 2)))
Apartment's SalePrice is in the average: but it looks like the building is quite recent and big, with the top "OverallQual"... maybe the price was too low?
# 2007
print('Avg yearBuilt:')
print('{} (< 2007)'.format(round(dataset_complete['YearBuilt'].mean(), 2)))
# 40094
print('Avg LotArea:')
print('{} (< 40094)'.format(round(dataset_complete['LotArea'].mean(), 2)))
# 10
print('Avg OverallQual:')
print('{} (< 10)'.format(round(dataset_complete['OverallQual'].mean(), 2)))
What we can think is that, maybe, we are not considering some important variables because not available (eg: the crime rate in the area where the apartment is located). It could also be a price forced by the economical crisis that started in the same year the apartment was sold: it may be that the owner, in economic need, had to sold it at a very convenient price.
Sometimes, the target depends on the product of two or more features: which means, the model we are trying to build is not additive. This effect is called synergy and can be explained in this way: the response variable is affected by the fact that a bounch of features are acting together (at the same moment) or not.
# store the original prices
original_target = dataset['SalePrice']
# work with the square root of the prices
dataset['SalePrice_sqrt'] = np.sqrt(dataset['SalePrice'])
dataset = dataset.drop(columns=['SalePrice'])
We get rid of those points we have spotted before: they are potentially dangerous because they can significally alter how the regression model fits the data and reduce the model performances
# high leverage points
dataset = dataset[~dataset.index.isin(hl_points)]
# outliers
dataset = dataset[~dataset.index.isin(outliers)]
We have spotted that some dependencies are actually non-linear: what we are going to do is to featurize the records, creating the square value for every quantitative feature. Then, we will perform the feature selection to get rid of those features that actually are not interesting in predicting the target.
for feature in dataset:
# ignore the target
if feature != 'SalePrice_sqrt':
# consider only quantitative features
if feature in features_metadata and features_metadata[feature]['continuous']:
dataset['{}^2'.format(feature)] = dataset[feature]**2
print('We have now {} features'.format(dataset.shape[1]))
Let us see how the residual plot looks like:
X, y = features_and_target(dataset, target='SalePrice_sqrt')
# evaluate the model (Multiple Linear Regression)
lr = linear_model.LinearRegression()
model = lr.fit(X, y)
# retrieve predictions: y^
Y_pred = lr.predict(X)
# evaluate the residuals
residuals = y - Y_pred
# plot them
residual_plot(Y_pred, residuals)
dataset.shape
The analysis is terminated: we want to reset the "SalePrice" field as our target
dataset['SalePrice'] = original_target
dataset = dataset.drop(columns='SalePrice_sqrt')
Our starting point will be the Multiple Linear regression model developed so far. In order to try to improve its performances, we want to explore other kind of models and feature selection techniques.
Models obtained will be compared each other: the best ones will be merged together to create a complex model able to merge their predictions as an unique number.
from sklearn import ensemble, model_selection, decomposition, neighbors
In order to compute and parametrize the alternative models, we will use Spark. Our cluster can offer us a degree of parallelism equals to 12
parallelism = 12
# broadcast variables
global Xb
global yb
This function will be used to plot test errors
def model_validation_results(y, y_pred, verbose=True):
'''
This method is used to inspect how accurate the predictions of a model are.
In more details, the method prints to the user some information about the prediction errors and plot three graphs:
- estimates and true values comparison
- absolute errors
- relative errors
'''
# compare predictions and test targets: evaluate the errors
delta = np.abs(y - y_pred)
abs_errors = list(delta)
rel_errors = list(delta / y * 100)
# compute some model quality parameter
mse = round(np.dot(delta, delta)/len(delta), 2)
max_error = round(max(abs_errors), 2)
min_error = round(min(abs_errors), 2)
mean_error = round(np.mean(abs_errors), 2)
max_rel_error = round(max(rel_errors), 2)
min_rel_error = round(min(rel_errors), 2)
mean_rel_error = round(np.mean(rel_errors), 2)
# print results
if verbose:
print('MSE: {}'.format(mse))
print('\nMax absolute error: {}'.format(max_error))
print('Min absolute error: {}'.format(min_error))
print('Avg absolute error: {}'.format(mean_error))
print('\nMax relative error: {}%'.format(max_rel_error))
print('Min relative error: {}%'.format(min_rel_error))
print('Avg relative error: {}%'.format(mean_rel_error))
figsize=(80, 8)
N = len(y)
# estimated vs true values
plt.figure(figsize=figsize)
plt.title('Estimates and True SalePrices comparison', fontsize=22, y=1.02)
plt.plot(y_pred, 'r', label='Estimated prices')
plt.plot(list(y), 'g', label='True prices')
plt.xlabel('Apartment', fontsize=16)
plt.ylabel('Estimated and true SalePrices, in $', fontsize=16)
plt.xlim(0, N)
plt.legend()
plt.grid(True)
# absolute errors obtained
plt.figure(figsize=figsize)
plt.title('SalePrice absolute errors obtained', fontsize=22, y=1.02)
plt.plot(abs_errors)
plt.xlabel('Apartment', fontsize=16)
plt.ylabel('Abs. error, in $', fontsize=16)
plt.xlim(0, N)
plt.grid(True)
# relative errors obtained
plt.figure(figsize=figsize)
plt.title('SalePrice relative errors obtained', fontsize=22, y=1.02)
plt.plot(rel_errors, 'g')
plt.xlabel('Apartment', fontsize=16)
plt.ylabel('Rel. error, in %', fontsize=16)
plt.xlim(0, N)
plt.grid(True)
# return results
return mse, max_error, min_error, mean_error, abs_errors, rel_errors
# build our model
X, y = features_and_target(dataset)
model_mlr = linear_model.LinearRegression()
# fit the model (applying the square root to the target variable)
y_pred = model_selection.cross_val_predict(model, X, np.sqrt(y))**2
# see the prediction error
_ = model_validation_results(y, y_pred)
# keep track of the features used by this model
features_mlr = X.columns
print(features_mlr)
We think the multiple linear regression model alone would perform quite poorly anyway. For this reason, we try to find some model alternatives and explore the sklearn-library
We would like to see if we are able to increase the prediction performances by averaging the estimates from several different models. Hence, our next step is exploring the models available inside the scikit-learn library: we will compare them and select the best ones.
Before starting, let's drop the outliers and high leverage points defining the base dataset versions on which we are going to work with
# create a version of the original dataset with squared features
dataset_complete_2 = dataset_complete.copy()
for feature in dataset_complete:
# ignore the target
if feature != 'SalePrice':
# consider only quantitative features
if feature in features_metadata and features_metadata[feature]['continuous']:
dataset_complete_2['{}^2'.format(feature)] = dataset_complete_2[feature]**2
# filter out high leverage points and outliers
dataset_cleaned = dataset_complete.copy()
dataset_cleaned = dataset_cleaned[~dataset_cleaned.index.isin(outliers)]
dataset_cleaned = dataset_cleaned[~dataset_cleaned.index.isin(hl_points)]
# filter out high leverage points and outliers
dataset_cleaned_2 = dataset_complete_2.copy()
dataset_cleaned_2 = dataset_cleaned_2[~dataset_cleaned_2.index.isin(outliers)]
dataset_cleaned_2 = dataset_cleaned_2[~dataset_cleaned_2.index.isin(hl_points)]
def tuning_plot(params, model_scores, model_name, param_name):
'''
Plot a graph in which the obtained model scores are shown, according to its parameter's values
'''
plt.figure(figsize=(15, 8))
plt.title('Tuning the model: {}'.format(model_name), fontsize=22, y=1.02)
plt.plot(params, model_scores)
plt.xlabel(param_name, fontsize=16)
plt.ylabel('Model score', fontsize=16)
plt.grid(True)
We now start to consider the alternatives
# isolate features and target
X, y = features_and_target(dataset_cleaned_2)
# feature importance with GradientBoostingRegressor
res = feature_importance(X, y, method='gradBoosting')
m = np.mean(res)
# keep only interesting features
dataset = dataset_cleaned_2.filter(items=np.append(X.columns[res > 2*m].values, 'SalePrice'))
X, y = features_and_target(dataset)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# selected features
features_gbr = X.columns
# learning rates
lrs = np.linspace(1e-3, 1, 100)
# create the instances for the models (Gradient Boosting)
models = list(map(lambda lr: ensemble.GradientBoostingRegressor(learning_rate=lr), lrs))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_lr_gbr = lrs[np.argmax(scores)]
print('Best learning rate found: {}\n'.format(best_lr_gbr))
# the best model obtained
model_gbr = models[np.argmax(scores)]
# plot the scores, according to the tuning parameter
tuning_plot(lrs, scores, 'Gradient Boosting Regressor', 'Learning rate ($\lambda$)')
print(features_gbr)
# isolate features and targets
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# number of principal components
Ns = list(range(1, 21))
# instanciate PCAs
PCAs = list(map(lambda n: decomposition.PCA(n_components=n), Ns))
rdd = sc.parallelize(PCAs, parallelism)
# apply PCA decomposition and score the LMR models
scores = rdd.map(lambda pca: model_selection.cross_val_score(
linear_model.LinearRegression(),
pca.fit_transform(Xb.value),
yb.value
).mean()
).collect()
# find the best learning rate
best_n_pca = Ns[np.argmax(scores)]
print('Best N-value found: {}\n'.format(best_n_pca))
# plot the scores, according to the tuning parameter
tuning_plot(Ns, scores, 'PCA decomposition', 'Number of components')
The KNN Regression model is more flexible than the Multiple Linear Regression one: it means that, when the relation between features and target is highly non-linear, this method can be much more performant. We want to try it and see how it performs.
We are going to work with a dataset
DOCS: here
# keep only interesting features
dataset = dataset_cleaned_2.filter(items=np.append(features_mlr, 'SalePrice'))
X, y = features_and_target(dataset)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# K values
Ks = list(range(1, 11))
# create the instances for the models (KNN)
models = list(map(lambda k: neighbors.KNeighborsRegressor(n_neighbors=k), Ks))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best K value
best_k_knn = Ks[np.argmax(scores)]
print('Best K value found: {}\n'.format(best_k_knn))
# the best model obtained
model_knn = models[np.argmax(scores)]
# plot the scores, according to the tuning parameter
tuning_plot(Ks, scores, 'K-Nearest Neighbors Regressor', 'N. neighbors (K)')
This technique works in a similar way to the Multiple Linear Regression model, but:
The predictions are retrieved according to this formula $$ \hat{y_i} = \beta_0 + \sum_{j=1}^p \beta_jx_{ij} $$
And the loss function becomes: $$ \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^p |\beta_j| $$
Where:
We are going to use the dataset with all the features (also the square values), but no outliers and HL-points
DOCS: here
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# learning rates
lrs = np.linspace(0.7, 70, 100)
# create the instances for the models (Lasso)
models = list(map(lambda lr: linear_model.Lasso(alpha=lr), lrs))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_lr_las = lrs[np.argmax(scores)]
print('Best learning rate found: {}\n'.format(best_lr_las))
# the best model obtained
model_las = models[np.argmax(scores)]
# plot the scores, according to the tuning parameter
tuning_plot(lrs, scores, 'Lasso', 'Penalty coefficient ($\lambda$)')
A technique very similar to Lasso: the differences are that
So, the loss function now becomes: $$ \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^p \beta_j^2 $$
We are going to use the dataset with all the features (also the square values), but no outliers and HL-points
DOCS: here
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# learning rates
lrs = np.linspace(0.7, 70, 100)
# create the instances for the models (Ridge)
models = list(map(lambda lr: linear_model.Ridge(alpha=lr), lrs))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_lr_rid = lrs[np.argmax(scores)]
print('Best learning rate found: {}\n'.format(best_lr_rid))
# the best model obtained
model_rid = models[np.argmax(scores)]
# plot the scores, according to the tuning parameter
tuning_plot(lrs, scores, 'Ridge Regression', 'Penalty coefficient ($\lambda$)')
We are now starting to explore ensemble methods: the first one we analyse is the Bagging Regression, built combining together Decision Tree Regressors.
Bagging's objective is trying to reduce the final model's error variance: the base regressors (Decision Trees, in our case) work on different portions of the dataset, fitting their own model. When performing the prediction task, all the predictions coming from these models are averaged together.
We are going to use the dataset with all the features (also the square values), but no outliers and HL-points
DOCS: here
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# number of base regressors to use
Ns = list(range(50, 5000, 100))
# create the instances for the models (Bagging)
models = list(map(lambda n: ensemble.BaggingRegressor(bootstrap=True, n_estimators=n), Ns))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best number of base regressors
best_n_bag = Ns[np.argmax(scores)]
print('Best N-value found: {}\n'.format(best_n_bag))
# the best model obtained
model_bag = models[np.argmax(scores)]
# plot the scores, according to the tuning parameter
tuning_plot(Ns, scores, 'Bagging Regressor', 'Number of base regressors (n)')
We now consider is the AdaBoost Regression built combining together Decision Tree Regressors.
AdaBoost is a famous boosting technique, whose objective is trying to reduce the final model's error variance. it is something similar to bagging, but here trees are built sequentially. The shrinkage parameter ($\lambda$) has to be tuned properly.
We are going to use the dataset with all the features (also the square values), but no outliers and HL-points. The number of base estimators will be the same found for the Bagging Regressor
DOCS: here
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# learning rates to try
lrs = np.linspace(1e-3, 2, 100)
# create the instances for the models (AdaBoost)
models = list(map(lambda lr: ensemble.AdaBoostRegressor(learning_rate=lr, n_estimators=best_n_bag), lrs))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_lr_ada = lrs[np.argmax(scores)]
print('Best learning rate found: {}\n'.format(best_lr_ada))
# the best model obtained
model_ada = models[np.argmax(scores)]
# plot the scores, according to the tuning parameter
tuning_plot(lrs, scores, 'AdaBoost Regressor', 'Learning rate ($\lambda$)')
This methods use a forest (set of Decision Trees) to perform the Regression: base regressors predictions are averaged together.
We are going to use the dataset with all the features (also the square values), but no outliers and HL-points. The tuning parameter is the number of base regressors to instanciate.
DOCS: here
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# learning rates to try
Ns = list(range(10, 10000, 100))
# create the instances for the models (Random Forest)
models = list(map(lambda n: ensemble.ExtraTreesRegressor(bootstrap=True, n_estimators=n), Ns))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_n_rfr = Ns[np.argmax(scores)]
print('Best N-value found: {}\n'.format(best_n_rfr))
# the best model obtained
model_rfr = models[np.argmax(scores)]
# plot the scores, according to the tuning parameter
tuning_plot(Ns, scores, 'Random Forest Regressor', 'Number of base regressors (n)')
A technique similar to the Random Forset one: the difference is how the dataset is split when assigned to the base regressors. Here, the thresholds for candidate features are more randomic.
We are going to use the dataset with all the features (also the square values), but no outliers and HL-points. The tuning parameter is the number of base regressors to instanciate.
DOCS: here
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# learning rates to try
Ns = list(range(10, 10000, 100))
# create the instances for the models (Bagging)
models = list(map(lambda n: ensemble.RandomForestRegressor(bootstrap=True, n_estimators=n), Ns))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_n_etr = Ns[np.argmax(scores)]
print('Best N-value found: {}\n'.format(best_n_etr))
# the best model obtained
model_etr = models[np.argmax(scores)]
# plot the scores, according to the tuning parameter
tuning_plot(Ns, scores, 'Extremely Randomized Trees Regressor', 'Number of base regressors (n)')
Our model alternatives are ready to be compared each other. In order to perform it efficiently, we create a wrapper class called CompareModels whose primary goal is to plot the $R^2$ cross-validation model scores.
The higher the score, the higher the model performances in predicting apartments SalePrices.
class CompareModels(object):
'''
This class can be used as a wrapper to compare alternative models.
The comparison can be done quickly by plotting the scores for each model.
'''
def __init__(self, X, y, n_batch=3):
'''
Instanciate the wrapper, assigning to it the dataset (split in X and y)
and deciding the number of batches to be used during cross-validation
'''
self.X = X
self.y = y
self.n_batch = n_batch
self.models = []
def add_model(self, model, name, f_X=None, f_y=None):
'''
Attach one model to the wrapper. You need to provide both the model instance
and the model name to display in the graph.
Optionally, you can specify the functions to be applied a priori:
- f_X, on predictors
- f_y, on targets
'''
# if not specify, the two functions are the identify function (no input transformation)
if f_X is None:
f_X = lambda x: x
if f_y is None:
f_y = lambda y: y
self.models.append(
{
'name': name,
'instance': model,
'f_X': f_X,
'f_y': f_y,
'score': None
}
)
def score(self):
'''
Score every model assigned to the class performing cross-validation.
Models with highest performances are the ones with highest scores (bars) inside the plot.
'''
# compute scores
for m in self.models:
# validate the model with cross-validation
m['score'] = sklearn.model_selection.cross_val_score(m['instance'], m['f_X'](self.X), m['f_y'](self.y), cv=self.n_batch).mean()
# plot results
plt.figure(figsize=(15, 8))
plt.title('Model comparisons', fontsize=22, y=1.02)
sns.barplot(
x = list(map(lambda m: m['name'], self.models)),
y = list(map(lambda m: m['score'], self.models))
)
plt.xlabel('Regression model', fontsize=16)
plt.ylabel('Cross-validation $R^2$ score', fontsize=16)
plt.xticks(rotation='90')
plt.grid(True)
# methods to apply in order to transform the dataset
def f_X_mlr(X):
return X.filter(items=features_mlr)
def f_X_gbr(X):
return X.filter(items=features_gbr)
def f_X_pca(X):
return sklearn.decomposition.PCA(n_components=best_n_pca).fit_transform(X)
def f_X_knn(X):
return X.filter(items=features_mlr)
# instanciate the wrapper
X, y = features_and_target(dataset_cleaned_2)
myModels = CompareModels(X, y)
# add the previous model alternatives to the wrapper
myModels.add_model(model_mlr, 'Multiple Linear Regression', f_X=f_X_mlr, f_y=np.sqrt)
myModels.add_model(model_gbr, 'Gradient Boosting ($\lambda$={})'.format(best_lr_gbr), f_X=f_X_gbr)
myModels.add_model(model_pca, 'PCA + Multiple Linear Regression', f_X=f_X_pca)
myModels.add_model(model_knn, 'K-Nearest Neighbor (K={})'.format(best_k_knn), f_X=f_X_knn)
myModels.add_model(model_las, 'Lasso ($\lambda$={})'.format(best_lr_las))
myModels.add_model(model_rid, 'Ridge ($\lambda$={})'.format(best_lr_rid))
myModels.add_model(model_bag, 'Bagging (n={})'.format(best_n_bag))
myModels.add_model(model_ada, 'AdaBoost (n={}, $\lambda$={})'.format(best_n_bag, round(best_lr_ada, 4)))
myModels.add_model(model_rfr, 'Random Forest (n={})'.format(best_n_rfr))
myModels.add_model(model_etr, 'Extra Trees (n={})'.format(best_n_etr))
# score the models and compare them each other
myModels.score()
# see the top-6 models and their scores
for m in sorted(myModels.models, key=lambda x: x['score'], reverse=True)[:6]:
print('{}: {}'.format(m['name'], m['score']))
Here we are: we can see that the 6 most performant models are:
We have analysed several regression model alternatives: we have created them and we have selected only the ones with best performances. The next step is to merge these models together, creating our final model to be used for SalePrice predictions.
The final prediction will be the weighted average of the predictions given by the base regressor we have chosen. In order to decide the weights, the idea is to use the base predictions as predictors for the target.
In this way, we will be able to see which of these model has major impact in order to establish the final prediction.
The easiest way to proceed would be to simply average these values, assigning each weight the same number. Anyway, we decide to proceed following a more general approach.
In our code, feature importance is evaluated using the Extra Trees class from scikit-learn.
preds_as_X = pd.DataFrame({
'Ridge': sklearn.model_selection.cross_val_predict(model_rid, X, y),
'MLR': sklearn.model_selection.cross_val_predict(model_mlr, f_X_mlr(X), np.sqrt(y))**2,
'GradBoost': sklearn.model_selection.cross_val_predict(model_gbr, f_X_gbr(X), y),
'Lasso': sklearn.model_selection.cross_val_predict(model_las, X, y),
'Bagging': sklearn.model_selection.cross_val_predict(model_bag, X, y),
'ExtraTrees': sklearn.model_selection.cross_val_predict(model_etr, X, y)
})
# compute the weights as feature importances
final_model_weights = feature_importance(preds_as_X, y, figsize=(15, 8))
print(final_model_weights)
This model will be realized as a class, called ApartmentPricePredictor
class ApartmentPricePredictor(object):
'''
This class is used to instantiate a model used to predict SalePrices. It is composed by six different base regressors,
whose predictions are averaged together as final result.
These models are the six we have discovered above, with best performances.
WARNING: not intended to work as a stand-alone class!
In fact, internal values refers to variables whose values have been computed before (eg: the optimal models).
'''
def __init__(self):
'''
The six internal base regressors are created, using previous computed results
'''
# specify if the model has already been trained
self.trained = False
# Bagging
self.bag = {
'instance': model_bag,
'model': None,
'weight': final_model_weights[0]
}
# Extra Trees Regressor
self.etr = {
'instance': model_etr,
'model': None,
'weight': final_model_weights[1]
}
# Gradient Boosting Regressor
self.gbr = {
'instance': model_gbr,
'model': None,
'weight': final_model_weights[2]
}
# Lasso
self.las = {
'instance': model_las,
'model': None,
'weight': final_model_weights[3]
}
# Multiple Linear Regression model
self.mlr = {
'instance': model_mlr,
'model': None,
'weight': final_model_weights[4]
}
# Ridge
self.rid = {
'instance': model_rid,
'model': None,
'weight': final_model_weights[5]
}
def train(self, X, y):
'''
Train the model over the specified dataset
'''
# Bagging
self.bag['model'] = self.bag['instance'].fit(X, y)
# Extra Trees
self.etr['model'] = self.etr['instance'].fit(X, y)
# Gradient Boosting: select a subset of the features
self.gbr['model'] = self.gbr['instance'].fit(f_X_gbr(X), y)
# Lasso
self.las['model'] = self.las['instance'].fit(X, y)
# MLR model: select a subset of the features and works with the square root of the SalePrice
self.mlr['model'] = self.mlr['instance'].fit(f_X_mlr(X), np.sqrt(y))
# Ridge
self.rid['model'] = self.rid['instance'].fit(X, y)
# mark the model as trained
self.trained = True
def predict(self, X):
'''
Run predictions using the trained model
'''
# the model must have been trained
if not self.trained:
raise RuntimeError('The model has never been trained before!')
# evaluate the predictions
def _pred(m, X, f_X=None, f_y=None):
'''
Internally used to evaluate the model prediction "p", over the predictors "X".
If you need to manipulate the predictions, pass a function to the "f_y" parameter
'''
X_ = X if f_X is None else f_X(X)
p = m['model'].predict(X_)
if f_y is not None:
p = f_y(p)
return p * m['weight']
# Bagging
bag_pred = _pred(self.bag, X)
# Extra Trees
etr_pred = _pred(self.etr, X)
# Gradient Boosting: select a subset of the features
gbr_pred = _pred(self.gbr, X, f_X=f_X_gbr)
# Lasso
las_pred = _pred(self.las, X)
# MLR model: it works predicting the square root of the price
mlr_pred = _pred(self.mlr, X, f_X=f_X_mlr, f_y=lambda y: y**2)
# Ridge
rid_pred = _pred(self.rid, X)
# merge these predictions
return np.sum(np.array([bag_pred, etr_pred, gbr_pred, las_pred, mlr_pred, rid_pred]), axis=0)
def cross_val_predict(self, X, y, cv=3):
'''
Run predictions using the trained model (with cross-validation)
'''
# evaluate the predictions
def _pred(m, X, y, f_X=None, f_y_priori=None, f_y_posteriori=None):
'''
Internally used to evaluate the model prediction "p", over the predictors "X".
If you need to manipulate the predictions, pass a function to the "f_y" parameter
'''
# if not specified, "f"s are the identity functions
X_ = X if f_X is None else f_X(X)
y_ = y if f_y_priori is None else f_y_priori(y)
# perform cross validation to make predictions
res = sklearn.model_selection.cross_val_predict(m['instance'], X_, y_, cv=cv)
if f_y_posteriori is not None:
res = f_y_posteriori(res)
# final result
return res * m['weight']
# Bagging
bag_pred = _pred(self.bag, X, y)
# Extra Trees
etr_pred = _pred(self.etr, X, y)
# Gradient Boosting: select a subset of the features
gbr_pred = _pred(self.gbr, X, y, f_X=f_X_gbr)
# Lasso
las_pred = _pred(self.las, X, y)
# MLR model: it works predicting the square root of the price
mlr_pred = _pred(self.mlr, X, y, f_X=f_X_mlr, f_y_priori=np.sqrt, f_y_posteriori=lambda y: y**2)
# Ridge
rid_pred = _pred(self.rid, X, y)
# merge these predictions
return np.sum(np.array([bag_pred, etr_pred, gbr_pred, las_pred, mlr_pred, rid_pred]), axis=0)
def validate(self, X, y, cv=3, verbose=True):
'''
Perform the cross-validation to validate the model and print obtained results
'''
# perform predictions
y_pred = self.cross_val_predict(X, y, cv=cv)
# compare predictions and test targets: evaluate the errors
return model_validation_results(y, y_pred)
X_train, y_train = features_and_target(dataset_cleaned_2)
final_model = ApartmentPricePredictor()
_ = final_model.validate(X_train, y_train)
Our experiment produced a nice result: the final model performances are better than the ones of the Multiple Linear Regression one. We succeeded in reducing both the model's MSE and the average errors.
We can now perform the predictions on the given testset.
Before predicting, it is important to adapt the features to the ones we have used during the training. It means that:
# the test set may have features which not exist in the training set
X_test = testset_complete.filter(items=dataset_complete_2.columns)
# generate quadratic features for the test set
for feature in X_test:
# ignore the target
if feature != 'SalePrice':
# consider only quantitative features
if feature in features_metadata and features_metadata[feature]['continuous']:
X_test['{}^2'.format(feature)] = X_test[feature]**2
# some non-existing features of the test set may actually have been present during the training phase
missing_features = set(dataset_complete_2.columns) - set(X_test.columns) - set(['SalePrice'])
print('{} missing features:'.format(len(missing_features)))
print(missing_features)
All of these features are actually dummy ones, generated starting from qualitative predictors.
It is safe to set them all to zero
# set dummy features to zero
for feature in missing_features:
X_test[feature] = 0
# we need now to adapt the DataFrame schema
foo = pd.DataFrame()
for c in range(len(X_test.columns)):
c_name = X_train.columns[c]
foo[c_name] = X_test[c_name]
# check that data have been copied correctly
for c in X_test.columns:
assert np.all(X_test[c] == foo[c])
# check the column are in the correct order
assert np.all(foo.columns == X_train.columns)
# we do not need "foo" anymore
X_test = foo
foo = None
We can use the entire dataset as training set: predictions will be evaluated over the test set
# train the model
final_model.train(X_train, y_train)
# retrieve the predictions
final_result = np.round(final_model.predict(X_test))
print(final_result[:5])
As required, we have to save our predictions as a CSV file with header (Id, SalePrice)
pd.DataFrame({
'Id': testset_Ids,
'SalePrice': final_result
}).to_csv('final_predictions.csv', index=False)
Although a lot of things have been considered and analysed, several other alternative techniques could have been tested and explored. We had not enough time to try to perform the synergy effects analysis or to try other configurations for the alternative models we had developed.
For example, we could have considered what happens trying to keep only a subset of the features for models like the Extra Trees Regressor: or if some of the scikit-learn solutions can work well also in presence of outliers and high-leverage points.
Developing this notebook has been the opportunity for us to learn a lot of things. We are starting to feel much more confident when facing more complex machine learning problems. Thank you for the efforts you put in setting up the challenge for us.